84
7
The Transmission of Information
a sort of Boltzmann distribution.upper CC is a constant fixed by the condition thatsigma summation p Subscript i Baseline equals 1Σ pi = 1,
and upper DD is an as yet undetermined constant.
Suppose that the words are made up of individual letters (symbols) and demarcated
by a special word demarcation symbol (the space in many languages). Cost, length,
and number of letters are all proportional to each other. If the letters can be chosen in
any way from an alphabet ofupper AA different ones, by the multiplication rule (Sect. 8.2.1)
there are upper A Superscript nAn different nn-letter words. Let these words now be ranked in order of
increasing cost and call this rank rr. Since the cost increases linearly with nn, it only
increases logarithmically with rank, 9 that is,
c Subscript r Baseline equals log Subscript upper A Baseline r periodcr = logA r .
(7.5)
Substituting Eq. (7.5) into (7.4), one obtains a power law relation
p Subscript r Baseline equals upper C r Superscript negative upper B Baseline commapr = Cr−B ,
(7.6)
known as Zipf’s law when upper B equals 1B = 1. Mandelbrot has shown that, more precisely, Eq.
(7.6) is
p Subscript r Baseline equals upper C left parenthesis r plus rho right parenthesis Superscript negative upper Bpr = C(r + ρ)−B
(7.7)
and that the constantupper BB (subsumingupper DD in Eq. 7.4), the reciprocal of the informational
temperaturethetaθ of the distribution (by analogy with the thermodynamic case), can take
values other than 1. Forupper B greater than 1B > 1 (i.e.,theta less than 1θ < 1), the language is called open (because the
value ofupper CC does not greatly depend on the total number of words), whereas forupper B less than 1B < 1
it does, and the corresponding language is called closed. The constantrhoρ is connected
with the freedom of choosing words (cf. Sect. ??), but a deep interpretation of its
significance in messages has not yet been given. Equation (7.7) fits the distribution
of written texts remarkably well, and most languages such as English, German,
and so forth are open, whereas highly stylized languages (e.g., modern Hebrew and
the English of the Pennsylvania Dutch) are closed. thetaθ is a measure of the agility of
exploiting vocabulary; low values are characteristic of children learning a language or
schizophrenic adults; the richest and most imaginative use of vocabulary corresponds
to theta equals 1θ = 1.
There are many heuristic methods for compression. Dictionaries (i.e., lists of
frequent words) are often used for word texts. In rastered images, successive lines
typically show small changes; large blocks are uniformly black, grey or white, and so
on. A useful way of compressing long sequences of symbols is to search for segments
that are duplicated. The duplicates can then be encoded by the distance of the match
from the original sequence and the length of the matching sequence (number of
symbols). Zipping software typically works on this principle; 10 the compression is
9 The words are listed in order of increasing cost; rank 1 has the lowest cost and so on.
10 For example, Ziv and Lempel (1977).